Diffusion Schrodinger Bridge with Applications to Score-Based Generative Modeling

Link to the paper: [link]

1. Discrete-Time: Markov Chains and Time Reversal (Recap)

Consider a data distribution with positive density $p_{\text{data}}$ , a positive prior density $p_{\text{prior}}$ , and a Markov chain with initial density $p_0 = p_{\text{data}}$ on $\mathbb{R}^d$ evolving according to positive transition densities $p_{k+1|k}$ for $k \in \{ 0, \dots, N-1 \}$ . By the Markov property, any $x_{0:N} = \{ x_k \}_{k=0}^N \in \mathcal{X} = (\mathbb{R}^d)^{N+1}$ , the joint density can be expressed as:

\begin{equation} p(x_{0:N}) = p_0(x_0) \prod_{k=0}^{N-1} p_{k+1 | k}(x_{k+1} | x_k) \end{equation}

The joint density also admits the backward decomposition:

\begin{equation} p(x_{0:N}) = p_N(x_N) \prod_{k=0}^{N-1} p_{k|k+1}(x_k | x_{k+1}) \end{equation}

where $p_k(x_k) = \int p_{k|k-1}(x_k | x_{k-1}) p_{k-1}(x_{k-1}) dx_{k-1}$ is the marginal density at step $k \geq 1$ .

☝

Why the backward decomposition is correct?

Prove by induction:

When $N=1$ , obviously it’s correct.

Assume that the equation holds for $N = M, M \in \mathbb{N}_+$ , then we prove it also holds for $N = M+1$ .

\begin{align*} p(x_{0:M+1}) &= p(x_{0:M}) p(x_{M+1} | x_{0:M}) \\ &= \left[ p_M(x_M) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \right] p(x_{M+1} | x_{0:M}) \quad \text{(by assumption)} \\ &= \left[ p_M(x_M) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \right] p_{M+1|M}(x_{M+1} | x_M) \quad \text{(by Markov Property)} \\ &= p_{M+1}(x_{M+1}) p_{M | M + 1}(x_M | x_{M+1}) \prod_{k=0}^{M-1} p_{k|k+1}(x_k | x_{k+1}) \quad \text{(by Bayes Rule)} \\ &= p_{M+1}(x_{M+1}) \prod_{k=0}^{M} p_{k|k+1}(x_k | x_{k+1}) \end{align*}

End of Proof

For the purpose of generative modeling, the transition densities are chosen such that $p_N(x_N) = \int p(x_{0:N}) dx_{0:N-1} \approx p_{\text{prior}} (x_N)$ for large $N$ , where $p_{\text{prior}}$ is an easy-to-sample prior density. To sample approximately from $p_{\text{data}}$ , one may use ancestral sampling with Equation (2), i.e. first sample $X_N \sim p_{\text{prior}}$ followed by $X_k \sim p_{k|k+1} (\cdot | X_{k+1})$ for $k \in \{ N-1, \cdots, 0 \}$ .

Equation (2) cannot be simulated exactly, but may be approximated if we consider a forward transition density of the form:

\begin{equation} p_{k+1 | k} (x_{k+1} | x_k) = \mathcal{N} (x_{k+1}; x_k + \gamma_{k+1} f(x_k), 2 \gamma_{k+1} \bold{I}) \end{equation}

with drift $f : \mathbb{R}^d \rightarrow \mathbb{R}^d$ and stepsize $\gamma_{k+1} > 0$ . Equation (2) can be first approximated by the following equation:

\begin{equation} \begin{align*} p_{k | k+1}(x_k | x_{k+1}) &= p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \\ &\approx \mathcal{N}(x_k; x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2 \gamma_{k+1}\nabla \log p_{k+1} (x_{k+1}, 2\gamma_{k+1} \bold{I}) \end{align*} \end{equation}

using that $p_k \approx p_{k+1}$ , a Taylor expansion of $\log p_{k+1}$ at $x_{k+1}$ and $f(x_k) \approx f(x_{k+1})$ .

☝

How to derive the approximation?

First,

\begin{align*} p_{k | k+1} (x_k |x_{k+1}) &= p_{k+1 | k}(x_{k+1} | x_k) \frac{p(x_k)}{p(x_{k+1})} \\ &= p_{k+1 | k}(x_{k+1} | x_k) \exp\left( \log \frac{p(x_k)}{p(x_{k+1})} \right) \\ &= p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \end{align*}

By Taylor expansion of $\log p_{k+1}(x)$ at $x = x_{k+1}$ :

\log p_{k+1}(x) \approx \log p_{k+1}(x_{k+1}) + \nabla \log p_{k+1}(x_{k+1})^T (x - x_{k+1})

Since $p_k \approx p_{k+1}$ , we have $\log p_k (x_k) \approx \log p_{k+1}(x)$ . Plug it into the above approximation:

\log p_k(x_k) - \log p_{k+1}(x_{k+1}) \approx \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1})

Since $p_{k+1 | k} (x_{k+1} | x_k)$ is a Gaussian, we have:

\begin{align*} &p_{k+1 | k}(x_{k+1} | x_k) \exp [\log p_k(x_k) - \log p_{k+1} (x_{k+1})] \\ \propto & \exp \left( -\frac{1}{4 \gamma_{k+1}} \left[ x_{k+1} - x_k - \gamma f(x_{k+1}) \right]^2 \right) \cdot \exp \left( \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1}) \right) \\ \propto& \exp \left( -\frac{1}{4 \gamma_{k+1}} \left[ ||x_k||^2 - 2(x_{k+1} - \gamma_{k+1} f(x_{k+1}))^T x_k - 4 \gamma_{k+1} \nabla \log p_{k+1}(x_{k+1})^T (x_k - x_{k+1}) \right] \right) \\ \propto& \exp \left( -\frac{1}{4 \gamma_{k+1}} [||x_k||^2 - 2x_k^T (x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2\gamma_{k+1} \nabla \log p_{k+1}(x_{k+1})] \right) \end{align*}

Since only Gaussian distribution has the quadratic exponential kernel, we conclude that:

p_{k|k+1}(x_k | x_{k+1}) \approx \mathcal{N}(x_k ; x_{k+1} - \gamma_{k+1} f(x_{k+1}) + 2 \gamma_{k+1} \nabla \log p_{k+1}(x_{k+1}), 2 \gamma_{k+1} \bold{I})

In practice, the approximation holds if $||x_{k+1} - x_k||$ is small which is ensured by choosing $\gamma_{k+1}$ small enough. $\nabla \log p_{k+1}$ is not available, but one can obtain its approximation using denoising score-matching methods.

We assume that the conditional density $p_{k+1 | 0} (x_{k+1} | x_0)$ is available analytically(e.g. gradient of Gaussian). We can show that $\nabla \log p_{k+1}(x_{k+1}) = \mathbb{E}_{p_{0 | k+1}} \left[ \nabla_{x_{k+1}} \log p_{k+1 | 0} (x_{k+1} |X_0) \right]$ .

☝

Prove the Equality.

First, a fact about the gradient:

\begin{align*} \nabla \log f(x) &= \frac{\nabla f(x)}{f(x)} \\ f(x) \nabla \log f(x) &= \nabla f(x) \end{align*}

Since $p_{k+1} (x_{k+1}) = \int p_0(x_0) p_{k+1 | 0}(x_{k+1} |x_0) dx_0$ , we have:

\begin{align*} \nabla \log p_{k+1}(x_{k+1}) &= \frac{\nabla_{x_{k+1}} p_{k+1}(x_{k+1})}{p_{k+1}(x_{k+1})} \\ &= \int \frac{p_0(x_0)}{p_{k+1}(x_{k+1})} \cdot \nabla_{x_{k+1}} p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \int \frac{p_0(x_0) p_{k+1|0}(x_{k+1} | x_0)}{p_{k+1}(x_{k+1})} \cdot \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \int p_{0 | k+1}(x_0 | x_{k+1}) \cdot \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0) dx_0 \\ &= \mathbb{E}_{p_{0 | k+1}} \left[ \nabla_{x_{k+1}} \log p_{k+1 | 0}(x_{k+1} |x_0)\right] \end{align*}

Therefore we can formulate the score estimation as a regression problem and use a flexible class of functions, e.g. neural networks, to parameterize an approximation $s_{\theta^*}(k, x_k) \approx \nabla \log p_k(x_k)$ such that:

\theta^* = \argmin_{\theta} \sum_{k=1}^N \mathbb{E}_{p_{0, k}} [||s_{\theta}(k, X_k) - \nabla_{x_k}\log p_{k|0}(X_k | X_0)||^2]

where $p_{0, k} = p_0(x_0) p_{k|0}(x_k | x_0)$ . This can be done by getting a sample $x_0$ from the dataset and use $p_{k|0}$ to obtain the corresponding $x_k$ .

If $p_{k|0}$ is not available, we use $\theta^* = \argmin_{\theta} \sum_{k=1}^N \mathbb{E}_{p_{k-1, k}}[||s_{\theta}(k, X_k) - \nabla_{x_k} \log p_{k | k-1}(X_k | X_{k-1}) ||^2]$ .

In Summary, Score-based Generative Modeling involves first estimating the score function $s_{\theta^*}$ from noisy data, and the sampling $X_0$ using $X_N \sim p_{\text{prior}}$ with ancestral sampling and approximation (Equation (4)), i.e.

\begin{equation} X_k = X_{k+1} - \gamma_{k+1} f(X_{k+1}) + 2 \gamma_{k+1} s_{\theta^*}(k+1, X_{k+1}) + \sqrt{2 \gamma_{k+1}} Z_{k+1} \end{equation}

where $Z_{k+1} \overset{\text{i.i.d}}{\sim} \mathcal{N}(0, \bold{I})$ .

2. Continuous-Time: SDEs, Reverse-Time SDEs and Theoretical results

The Markov chain with kernel in Equation (3) corresponds to an Euler-Maruyama discretization of $(\bold{X}_t)_{t \in [0, T]}$ , solving the following SDE:

\begin{equation} d \bold{X}_t = f(\bold{X}_t) dt + \sqrt{2} d \bold{B}_t, \qquad \bold{X}_0 \sim p_0 = p_{\text{data}} \end{equation}

where $(\bold{B}_t)_{t \in [0, T]}$ is a Brownian motion and $f: \mathbb{R}^d \rightarrow \mathbb{R}^d$ is regular enough so that solutions exists.

☝

Question on Equation (6)

If strictly follow Equation (3), the SDE should be:

d\bold{X}_t = \gamma_{t+1} f(\bold{X}_t) dt + \sqrt{2 \gamma_{t+1}} d \bold{B}_t

The discretization gives,

\bold{X}_{t+1} - \bold{X}_t = \gamma_{t+1} f(\bold{X}_t) + \sqrt{2 \gamma_{t+1}} \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \bold{I}) \\ \Rightarrow \bold{X}_{t+1} = \bold{X}_t + \gamma_{t+1} f(\bold{X}_t) + \sqrt{2 \gamma_{t+1}} \epsilon = \mathcal{N} (\bold{X}_{t+1}; \bold{X}_t + \gamma_{t+1} f(\bold{X}_t), 2 \gamma_{t+1} \bold{I})

Under some conditions on $f$ , the reverse-time process $(\bold{Y}_t)_{t \in [0, T]} = (\bold{X}_{T - t})_{t \in [0, T]}$ satisfies

\begin{equation} \text{d} \bold{Y}_t = \{ -f(\bold{Y}_t) + 2 \nabla \log p_{T-t}(\bold{Y}_t) \} \text{d} t + \sqrt{2} \text{d} \bold{B}_t \end{equation}

with initialization $\bold{Y}_0 \sim p_T$ , where $p_t$ denotes the marginal density of $\bold{X}_t$ .

☝

Another notation in Yang Song’s paper.

In another paper, the reverse-time SDE is denoted as:

\text{d} \bold{x} = [\bold{f}(\bold{x}, t) - g(t)^2 \nabla_{\bold{x}} \log p_t(\bold{x})] \text{d} t + g(t) \text{d} \bar{\bold{w}}

where the time flows backwards from T to 0 with initialization $\bold{x}_T \sim p_T$ .

Two notations are equivalent since the flow direction of time are reversed.

The reverse-time Markov chain $\{ Y_k \}_{k=0}^N$ associated with Equation (5) corresponds to an Euler-Maruyama discretization of Equation (7), where the score function are approximated by $s_{\theta^*} (t, x)$ .

Let’s consider $f(x) = -\alpha x$ for $\alpha \geq 0$ . This framework includes the one of Song and Ermon (2019) ( $\alpha > 0$ , $p_{\text{prior}}(x) = \mathcal{N}(x; 0, 2T \bold{I})$ ) for which $(\bold{X}_t)_{t\in[0,T]}$ is simply a Brownian motion. It also includes Ho et al. (2020) ( $\alpha = 0$ , $p_{\text{prior}}(x) = \mathcal{N}(x; 0, \bold{I} / \alpha)$ ) for which it is an Ornstein-Uhlenbeck process.

3. General SGM and links with existing works

Appendix C.3

General SGM Algorithm

Consider the forward process

\text{d} \bold{X}_t = f_t(\bold{X}_t) \text{d}t + \sqrt{2} \text{d} \bold{B}_t

The discretization gives:

X_{k+1} = X_k + \gamma_{k+1} f_k(X_k) + \sqrt{2 \gamma_{k+1}} Z_{k+1}

In general, we don’t have that $p(x_k | x_0)$ is a Gaussian density. However, we can obtain that for any $x \in \mathbb{R}^d$ ,

p_{k+1}(x) = (4 \pi \gamma_{k+1})^{-d / 2} \int_{\mathbb{R}^d} p_k(\tilde{x}) \exp [- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}

with $\mathcal{T}_{k+1}(x) = \tilde{x} + \gamma_{k+1}f_k(\tilde{x})$ . After some math, we can obtain:

\nabla \log p_{k+1} (x) = -(2 \gamma_{k+1})^{1/2} \mathbb{E} [Z_{k+1} | X_{k+1} = x]

☝

How is that equation derived?

Denote $\tilde{Z}_{k+1} = \sqrt{2 \gamma_{k+1}} Z_{k+1} = \mathcal{N}(0, 2 \gamma_{k+1} \bold{I})$ , then,

\begin{align*} p_{k+1}(x) &= \int p_k(\tilde{x}) \mathcal{N}(x - \mathcal{T}_{k+1}(\tilde{x}); 0, 2 \gamma_{k+1} \bold{I}) \text{d} \tilde{x} \\ &= \int p_k(\tilde{x}) (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \end{align*}

Take gradient on both side of the equation, and change the order of gradient and integration, we get:

\begin{align*} \nabla p_{k+1}(x) &= \int p_k(\tilde{x}) \frac{\mathcal{T}_{k+1}(\tilde{x}) - x}{2 \gamma_{k+1}} (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \end{align*}

Since $\nabla p_{k+1}(x) = p_{k+1}(x) \nabla \log p_{k+1}(x)$ , we get derive:

\begin{align*} [2\gamma_{k+1} p_{k+1}(x)] \nabla \log p_{k+1}(x) &= \int p_k(\tilde{x}) [\mathcal{T}_{k+1}(\tilde{x}) - x] (4\pi \gamma_{k+1})^{-d / 2} \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x} \\ 2 \gamma_{k+1} \nabla \log p_{k+1}(x) &= \frac{\int p_k(\tilde{x}) [\mathcal{T}_{k+1}(\tilde{x}) - x] \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}}{\int p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \text{d} \tilde{x}} \end{align*}

Now we inspect the conditional probability $p_{k+1|k}$ . Note that:

\begin{align*} p_{k+1|k}(x|\tilde{x}) &= p_{Z_{k+1}}(\frac{1}{\sqrt{2 \gamma_{k+1}}}[x - \mathcal{T}_{k+1}(\tilde{x})])\\ &= \exp[- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \end{align*}

Then, by Bayes’ Formula,

\begin{align*} p_{k|k+1}(\tilde{x} | x) &= \frac{p_{k+1|k}(x|\tilde{x})p_k(\tilde{x})}{p_{k+1}(x)}\\ &=\frac{1}{p_{k+1}(x)} p_k(\tilde{x}) \cdot \exp[- ||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] \end{align*}

Therefore, the score function can be written as:

\begin{align*} \nabla \log p_{k+1}(x) &= \frac{1}{2\gamma_{k+1}} \frac{\int [\mathcal{T}_{k+1}(\tilde{x}) - x] p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] / p_{k+1}(x) \text{d} \tilde{x}}{\int p_k(\tilde{x}) \exp [-||\mathcal{T}_{k+1}(\tilde{x}) - x||^2 / (4 \gamma_{k+1})] / p_{k+1}(x) \text{d} \tilde{x}} \\ &= \frac{1}{2\gamma_{k+1}} \int [\mathcal{T}_{k+1}(\tilde{x}) - x] p_{k|k+1}(\tilde{x} | x) \text{d} \tilde{x} \\ &= \frac{1}{2\gamma_{k+1}} \mathbb{E}_{X_k \sim p_{k|k+1}} [\mathcal{T}_{k+1}(X_k) - X_{k+1} | X_{k+1} = x] \\ &= \frac{1}{2\gamma_{k+1}} \mathbb{E}_{Z_{k+1} \sim \mathcal{N}(0, I)} [-\sqrt{2\gamma_{k+1}} Z_{k+1} | X_{k+1} = x] \\ &= - (2 \gamma_{k+1})^{-\frac{1}{2}} \cdot \mathbb{E}[Z_{k+1} | X_{k+1} = x] \end{align*}